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Progress in pre-clinical research is built on 
reproducible findings, yet reproducibility has 
different dimensions and even meanings. Indeed, 
the terms reproducibility, repeatability, and 
replicability are often used interchangeably, 
although each has a distinct definition. Moreover, 
reproducibility can be discussed at the level of 
methods, analysis, results, or conclusions (1, 2). 
Despite these differences in definitions and 
dimensions, the main aim for an individual 
research group is the ability to develop new 
studies and hypotheses based on firm and reliable 
findings from previous experiments. In practice 
this wish is often difficult to accomplish. In this 
review, issues affecting reproducibility in the field 
of mouse behavioral phenotyping are discussed. 


Crisis in reproducibility. Over the last ten years, 
the “reproducibility crisis” has often appeared in 
the headlines of scientific journals (3-6). Several 
factors have been identified as the major causes 
for irreproducibility -— including p-hacking, 
cherry-picking, low statistical power, publication 
bias, and hypothesizing after results are known 
(7). However, these issues mostly occur after the 
animal experiments are done. There are many 
more items to consider during the planning and 
running of an experiment — good experimental 
design includes considerations _ regarding 
randomization, blinding, details of housing, 
husbandry and animal care, the definition of the 
experimental unit, inclusion and_ exclusion 
criteria, and the choice of animal subjects (the 
source, health status, strain, sex and age of the 
animal), among others. (8-10). Guidelines and 
recommendations (e.g. ARRIVE, PREPARE) are 
available for addressing these factors (11, 12). 
However, despite the fact that the ARRIVE 
guidelines have existed for ten years and are 
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endorsed by more than 1000 journals so far, 
awareness of researchers and the quality of 
publications have not been sufficiently improved 
(13-15). In order to facilitate and enhance 
implementation of the ARRIVE guidelines, a 
revised version with exhaustive explanation and 
elaborative documentation was recently published 
(16, 17). 


Paradigm shift. Mice and rats are the most 
widely used model animals in basic biomedicine. 
However, there has been a drastic change in the 
relative use of these two rodent species over time 
(Figure 1). Historically, the rat was the model of 
choice for behavioral studies but from the 
beginning of 1990s a sharp shift from rats to mice 
took place. Obviously, this was due to rapid 
technological development in genetic engineering 
and the ability to create genetically modified mice 
(i.e. transgenic or targeted mutants). Initially these 
mice were only available in the most advanced 
laboratories (18, 19), but within ten years the use 
of genetically modified mouse models was 
widespread. Importantly, as it became a routine 
tool for almost every team in biomedical research, 
there was likely a prevailing impression that 
behavioral assessment was the easiest part of the 
process in discovering the function(s) of each 
gene. Moreover, it was supposed that rat 
paradigms could be easily translated and applied 
to mice. Yet it quickly became clear that mice are 
not little rats, and extensive work with mice has 
serious challenges (20-22). Another caveat is that 
while rat behavior in the laboratory has been 
studied for decades with the clear goal of 
understanding the mechanisms of behavior, the 
mouse is in the majority of cases studied only in 
the context of genetic modification (phenotyping 
the effects of gene targeting). This means that we 


may still be missing a lot of important basic 
information and knowledge about mouse behavior 
in laboratory conditions (note the trends in Figure 


1). 


a) Rat vs Mouse 





. 
< 
= 
3 re rat 
3 oF mouse 
(a ° 
a ° 
= “ 
. 
o : atPtee, 
c o,etetetn ay 
2 a 
= 
© 
2 
2 
3 
a 


0 
1960 1970 1980 1990 2000 2010 2020 


b) Rat Behavior vs Mouse Behavior 
15000 





- * rat 
c 
@ 10000 * mouse 
2 
8 
SB 5000 
s 
Qa 
t) 
1960 1970 1980 1990 2000 2010 2020 
c) "brain" AND "lesion" 
rat 
mouse 


publications, nr 





1960 1970 1980 1990 2000 2010 2020 
year 


Figure 1. Simple PubMed search (accessed 5.6.2020) with 
keywords a) “rat” / “mouse” b) “rat and behavior” / “mouse 
and behavior” c) “rat and brain and lesion” / “mouse and 
brain and lesion”. 


A mouse is not a mouse. The first mutant mice 
were made using embryonic stem cells from the 
129 mouse strain. However, it was known that 
these mice harbored several peculiarities 
complicating their use for neurobehavioral 
research, including poor breeding performance, 
hypo-activity, impaired learning, absent corpus 
callosum, and genetic contamination (23, 24). 
Therefore, the mice were crossed with another 
inbred strain, C57BL/6, which had been shown as 
a reasonable strain for various research topics, 
possessing intermediate phenotypes in many 
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readouts that allowed identification of both gain- 
and loss-of-function (25). Subsequently, 
phenotypes of the mutant (knockout) and control 
(wild type) mice could be compared in the 
F2-hybrid generation. Yet the phenotypes of these 
F2 mice may be convoluted by the presence of 
unusual background genes, especially by flanking 
(passenger) genes from the 129 strain (26, 27). 
Thus, the recommendation was to continue 
backcrossing with parental strains until congenic 
and co-isogenic lines were established (28). 
According to this recommendation, researchers 
would have to have in hand two distinct genetic 
backgrounds with the possibility to make an 
Fl-hybrid, which would have allowed for a 
powerful experimental design. However, in reality 
backcrossing has usually been done only to the 
CS57BL/6J strain (considered a “gold standard” 
strain, after its genome was sequenced in 2001). 
In order to overcome the problems associated with 
a mixed genetic background and to facilitate the 
production of mutant mice, embryonic stem cell 
lines from the C57BL/6N strain were established 
(29). The International Mouse Phenotyping 
Consortium (IMPC) is currently creating mutant 
mice for large-scale phenotyping in a C57BL/6N 
background (30, 31). However, many researchers 
are unaware of the genetic and phenotypic 
differences between these two _ sub-strains 
(C57BL/6J and CS7BL/6N), and the issue is 
further complicated by poor reporting of animal 
characteristics (full and correct strain name often 
missing) (32). Unfortunately, this is a major 
limitation for external validity and applicability of 
research with mutant mice. The use of inbred 
strains is well justified for reducing variability and 
increasing the precision of measurements 
(targeting genes in a known background). 
However, good design (and applicability) would 
require using more than one strain (33, 34) and 
there are numerous examples of phenotypic 
differences from the same mutation depending on 
the background strain (35). 


The phenotype is a result of gene-environment 
interplay. If a pre-clinical research group has 
created a mutant mouse line, then eventually the 
phenotypic characterization of live animals will 
be conducted. The early literature of such studies 


is full of controversies. Therefore, the behavioral 
neuroscience community has already been aware 
of the problems with (ir)reproducibility much 
earlier than the last decade, and in a way has been 
better prepared for the “crisis” (36-38). There has 
also been a major conflict between “molecular 
biologists” and “behaviorists” where the former 
could not understand or accept different results 
obtained by different laboratories. In order to 
tackle the discrepancies between laboratories, it 
was recommended to apply _ extensive 
standardization of procedures and environment. 
Such a solution was tested in a seminal study 
published in 1999 by Crabbe, Wahlsten and 
Dudek (39). They found that despite rigorous 
standardization of almost everything in three 
different laboratories, some of the results were 
idiosyncratic to a particular laboratory. Later, it 
was shown that among different factors 
contributing to the variability, the experimenter is 
the most prominent (40). However, there may be 
cases where standardization is required or desired 
— for instance, the IMPC has invested quite a bit 
in this type of effort, although the success has 
been variable (41). Exploring and tracking the 
causes for different outcomes between facilities 
may be a bumpy and painful process (42-44). All 
these findings exaggerated further the suspicions 
that behavioral studies are unreliable, an opinion 
especially expressed by researchers not working 
in the field of neurobehavioral research (43). 
However, an opposing theory was presented by 
Hanno Wurbel, suggesting that extreme 
standardization is a cause rather than cure for poor 
reproducibility (45, 46). Moreover, in addition to 
the principle of 3Rs (47), researchers working 
with animal models should adopt thinking in 
terms of 3V’s — construct validity, internal 
validity and external validity (applicability, 
generalizability) (48). A comprehensive review of 
the current standing and future perspectives for 
embracing biological variation for enhancing 
reproducibility was recently published (49). 
Phenotyping efforts without considering the 
impact of environmental and developmental 
factors can be misleading. 


Core facilities. Nowadays, making a knock-out 
mouse is a routine and standard procedure. The 
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real challenge is in comprehensive phenotyping 
(50). In 2000, Jacqueline Crawley published a 
book, “What’s Wrong With My Mouse” (second 
edition in 2007) (51), where she warned readers 
coming from molecular genetics that the aim was 
not to write a “how to” manual. Given that 
behavioral analysis is too complex to be treated as 
a “cookbook discipline’, “descriptions of the 
methods are intentionally superficial”, to give an 
overview of what is available. Inexperienced 
readers were advised to seek collaboration with 
experienced behavioral neuroscientists before 
setting up behavioral procedures. The real “pain 
and beauty” of mouse behavioral testing is 
comprehensively discussed in a book by Douglas 
Wahlsten (52). One solution for effectively 
tackling complexity in behavioral analysis has 
been the establishment of “core facilities” for 
behavioral assessment. By now, this is not 
surprising at all, because modern science is 
multidisciplinary, requiring special equipment 
(and more importantly, expertise) for adequately 
dealing with research questions. Thus, core 
facilities should help enhance the replicability and 
reliability of behavioral testing (53). The strengths 
and challenges of core facilities have been 
recently discussed (54, 55). 


The need for establishing a core facility must be 
based on demands from the research community, 
the potential users of the facility. In Helsinki, the 
Transgenic Unit was formed at the Laboratory 
Animal Center in 1996 by the Institute of 
Biotechnology. Two years later, an initiative for 
behavioral analysis of mutant mice was launched 
and I was recruited for this purpose. The 
laboratory in Helsinki was developed from the 
beginning with the idea of being open and 
offering as broad support as possible to everyone 
interested in behavioral phenotyping — testing of 
basic sensory and motor functions followed by 
more complex tasks for coping with stress, 
learning and memory, and testing approach / 
avoidance behavior (e.g. exploration-curiosity, 
fear-anxiety). 


However, as it often happens, I started with an 
instance of failure, which taught me a lot. The 
first transgenic line to be tested did not seem to 


learn spatial navigation task in the water maze (at 
that time considered a gold standard test for 
learning). I then found out that the mice were in 
FVB/N background (standard strain for transgenic 
mice at that time) which suffer from a mutation 
causing retinal degeneration and blindness (56). 
Curiously enough, there were papers published 
showing spatial learning in this strain (57). This 
was a lesson that warned me that meaningful 
work with mutant mice requires parallel studies 
with inbred strains — know thy mouse (58-60) and 
methods (54)! 


The end of 1990s was a very active period in the 
field of behavioral phenotyping — a new opening 
and shifting of paradigms. We learned a lot about 
differences between mouse strains (25) and about 
strategies to set up test batteries (as opposed to the 
tradition of conducting only one test per animal) 
(61, 62). Excellent international training courses 
and workshops were organized and I personally 
had the opportunity to attend courses arranged by 
Cold Spring Harbor Laboratory, EMBO / FENS, 
IBRO, IBANGS, and EUMORPHIA program. 
Inspiring interaction with fellow students and 
respected faculty (e.g. Jacqueline Crawley, 
Richard Paylor, Howard Eichenbaum, Seth Grant, 
Richard Morris, Hans-Peter Lipp, David Wolfer, 
Wim Crusio and many others) created a solid 
network and offered many good ideas for 
proceeding. My personal impression is that during 
the last 15 years there has been a decline in such 
high quality interactive courses. We have tried to 
fill this gap by organizing Baltic summer schools 
on the topic of rodent behavioral analysis (63). 
Indeed, learning, teaching and challenging 
existing paradigms are best achieved in 
communication and _ interaction between 
established and starting researchers -— for 
impressions from recent FENS courses, see (64). 


Quality monitoring! The main purpose of the 
core facility is to serve the research community 
(55). The organization of the facility and 
collaboration with users can have different forms, 
from full to minimum service. The responsibility 
of the facility is to maintain and take good care of 
the equipment and space, including monitoring of 
performance, necessary calibrations, and timely 
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repairs and replacements. An essential part of the 
responsibility is training users and supporting 
them in all steps of the project (planning, 
conducting, analyzing, and reporting). This is one 
part of quality. Another part, as mentioned above, 
is to be up-to-date with developments in theory 
and methodology in behavioral neuroscience and 
laboratory animal science. In addition, internal 
validity and consistency should be verified with 
some standards and calibrations for animal 
behavior. 


Users frequently ask what normal mouse behavior 
is, while thinking only in terms of their particular 
disease model. Moreover, it is often thought that 
the control, “wild-type” for gene-targeted mice, 
represents “normal”, immediately implying that 
gene targeting will result in “abnormal” animals. 
As described previously, it is difficult or 
impossible to answer the question of what 
“normal” behavior is given the many inbred 
strains available, along with the impact of 
environmental conditions. Each inbred strain has 
some peculiarities due to inbreeding — retinal 
degeneration, deafness, anatomical differences, 
susceptibility or resistance for certain conditions 
that can develop (e.g. diabetes) (65). We all may 
have heard the saying “your genetics is only as 
good as your phenotype” and “this mutant does 
not have a phenotype” (66). First, these ideas may 
cause bias towards the hypothesis and second, 
having no phenotype is impossible — the only 
conclusion in this case would be that the given 
mice in the given situation did not display a 
phenotypic difference from the other study 
groups. Therefore, each model needs to be placed 
in the broader context of mouse behavior and with 
observation of the environment. Another question 
that may be asked is which is the best test for 
memory (or anxiety, or any other domain). 
Researchers with different backgrounds may not 
be aware of different memory systems or different 
types of anxiety. This is further complicated by 
the fact that there are hundreds of tests available 
(67). Navigating this landscape is one of the tasks 
of experts working in core facilities. Of course, 
the facility also learns from its users — a good 


facility is open and flexible to adapting and 
developing new methods. 


Quality monitoring can be done by regular testing 
of inbred strains with known phenotypes. 
Although we cannot speak of animals as tools, we 
hope that the phenotypes of inbred strains are 
stable over time (68). Therefore, the testing of 
such animals could reveal if the conditions in the 
laboratory are stable. The C57BL/6 and DBA/2 
strains are the oldest strains available and much 
information has been collected about physiology, 
anatomy and behavior in these mice. In our 
studies, we have found consistent differences 
between these two strains in several conventional 
tests (open field, light-dark box, and forced swim 
test) throughout the years when our laboratory 
was located in three different buildings (60, 
69-71). 


Another important issue that is frequently 
discussed is the use of male and/or female mice 
(72). Indeed, it might be still difficult to convince 
researchers that including female mice in your 
studies does not ruin it — despite the evidence that 
female mice are no more variable than males (73, 
74). Including both sexes is mandatory for sound 
design and enhanced external validity (75). 
Despite many years of recommendations to 
consider sex as a biological variable, the change is 
taking place very slowly (76, 77). 


Finally yet importantly, the human factor 
(experimenter) in animal experiments cannot be 
neglected (40, 78). Therefore, handling techniques 
need to be trained and refined (79) in addition to 
improving the understanding of the behavior that 
is measured and recorded, even if the process is 
automated (80, 81). Testing animal behavior can 
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be challenging — “Despite our best efforts, the 
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Summary. In this short review, I tried to 
highlight the factors that I consider important or 
essential in running a meaningful (mouse) 
behavioral phenotyping program. I would like to 
conclude with the words of Michael Festing and 
Ulrich Dirnag!: 


‘We are not born knowing how to design and 
analyze scientific experiments (85).’ 


‘We should be [moving to the world] where 
biological thinking rules, and sound data 
production is emphasized through careful 
planning, design, execution, and reporting of 
our studies. A world where methods and 
results are transparently described so that 
effects and inferences can be independently 
confirmed (86).’ 


Thus, I encourage everyone to be open to new 
ideas while being skeptical of the phrase “we’ve 
always done it like this”. 
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